Britain surprised the world on Jun 23rd as they vote to leave EU.
I was double surprised on that day. For one, I had been tracking twitter on and off on that topic, and surprised to see how VoteLeave accounts get retweeted more on twitter. For another, like most others, I was surprised with the actual voting results.
After the shock rises my curiosity. Can tweets act as a poll, anyways? How acurate would that be, then?
My streaming started on 6.21. The streaming was on and off since then until the morning of 6.23 CDT. I wish I kept the streaming running all the time, but I was running between meetings and didn’t always remember to run the screaming script when I’m at my desk. Still, I end up streamed 11 segments tracking “#Euref” and “#Brexit” (about 0.5-1 hour for each segment), and get about half a million of tweets which were written in English.
Not surprisingly, most of the tweets are from UK (aside from the gigantic “Other” category).
Before getting into this part, I actually filtered out tweets which didn’t have #EURef hashtags to avoid biases I may get through the #Brexit tag. This leaves me ~ 130K tweets, still a good number to work with.
Among all these tweets, 68.56 percent are retweets. Who are people retweeting from, you ask? The plot below shows the twitter accounts that’s most frequently retweeted. By a quick glance, I was shocked to see how much retweeting the LeaveEUOffical had received. After doing a little googling on Who is Who in remain and leave campaigns, I can almost conclude the Leave campaign occupied some solid ground on twitter.
Of course, Simpsons cartoons were well retweeted, and (unsurprisingly) major publishers scored their social media coverage, too.
Frequent retweeted twitter screennames
Quick Note: if you are like me and are an outsider for British Politics, in the chart, Leave supporters include: LouiseMensch, NadineDorriesMP, theordinaryman2; Remain supporters include Snowden and bengoldacre. for a complete list, see this BBC page.
So, if we conduct a poll, solely using twitter hashtags, would the result resemble the above? Here’s a “twitter hashtag poll” using some popular hashtags.
For “Leave” tags, I used #leave, #voteleave, #leaveEU, #strongerout, #takecontrol;
For “Remain” tags, I used #remain, #voteremain, #stay, #strongerin, #votestay, #bremain. The results are displayed below. Note that most tweets fell in the “Unclear” category: these tweets have both (and an equal amount of) leave and remain tags.
Note: Those tags are most representative, but the actual filtering/ categorization was done based on regular expression rather than individual tags.
How would such a poll look across geographics? It looks like the poll would be proRemain, except for Yorkshire and Essex.
Note: Not every tweets have location information with it. For all the tweets from UK, 48.43 percent had valid city information.
There are some Politician’s name frequently mentioned in tweets, for example, Boris Johnson (“boris”), Nigel Farage (“farage”), and David Cameron (“cameron”). Out of curiosity, I want to see if people call names for those politicians in their tweets. The results?
Not surprisingly, first name and last name goes together, about half the time (And people aren’t checking their spellings, see “johnson” and “johnsons”). But if we ignore the association between first and last name, and simply combines the person and the term that’s most closely associated with them, what do we get? Well:
Boris Johnson: Ovation (followed by “speech”, his ovation speech got quite some attention, apparently)
Nigel Farage: Apologised (follwed by “attacked”, not entirely surprising, I guess?)
David Cameron: Blow (Osburne is a close tie too, but blow is ever so slightly more interesting…)
There are some issues frequently talked about in such discussion. A couple examples may include immigration, economy, and democracy. Using similar methods, we can see terms that closely relates to those issues. Based on the collection of tweets. Listed below are terms that has a .08 or higher association with the key term. Translate into human language, that means for every 100 tweets where the key term existed, there are 8 or more tweets where each assiciated term were present.
democracy: relates to accountability, freedom, sovereignty, lords
immigration: the only term that’s closely related to immigration is reduce
economy: the list goes like this crashes, contribute, mantra, crash, grown
There are more issues one can track, but for the assiciation analysis to return any meaningful results, the key term needs to have a good “head count” to start with (for that matter, it can be a good idea to stem all the words before such analysis).
Heated discussion like this never lacks strong emotions. Here I picked a couple terms that has been frequently mentioned in the news for the past days, namely, fear, and anger.
fear: project, mongering, hope, and prosperity
anger: leads, direct, suffering, unnecessary, anxiety, dark, angry, annoying, tea??, and westminister
and finally, disaster: putin, planet, offering, pollsters (should I mention the next in line for disaster is “trump”?)
There are some interesting pattern here (like between “disaster” and “putin”, and between “fear” and “hope”). Although this is a place where a 2-gram or 3-gram will provide information. For example, “anger” and “direct”, does Anger “direct”s public to vote, or some people are “direct”ing public’s anger?
Besides looking at specific emotions, we can also look at the overall sentiment of a tweet. Here I used the Afinn dictionary to calculate the score (One may want to normalize the score by the length of the tweets, which I didn’t do here). Results are shown in the plot below: nothing conclusive here, but all the tails seem to indicate some extreme emotions.
“Twitter hashtag poll” is a toy idea. But it is interesting to me how much information descriptive analysis alone can provide. I am writing another post on categorizing tweets based on the actual tweet texts– some feature engineering and predictive modeling will be involved. Stay tuned.